Skip to content

ENH: Automatically preserve links in added pages #3298

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jul 22, 2025

Conversation

larsga
Copy link
Contributor

@larsga larsga commented May 27, 2025

Here is a draft implementation of the first stage of the issue #3290 implementation. It handles links in pages added via add_page and insert_page, but it doesn't handle pages merged into those pages before adding.

Does this look OK?

I'm wondering if some users may already have written their own link patching code -- will this code break theirs? If so, should we make it possible to turn this behaviour off somehow?

At the moment I'm resolving everything by searching lists to find corresponding indirect references. It would be much faster with a hash, but I haven't been able to make that work. Thoughts?

@larsga larsga force-pushed the issue-3290 branch 2 times, most recently from d0b2c8a to 460139e Compare May 27, 2025 13:59
Copy link

codecov bot commented May 27, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.91%. Comparing base (bfe7178) to head (d6d47e2).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3298      +/-   ##
==========================================
+ Coverage   96.89%   96.91%   +0.02%     
==========================================
  Files          54       55       +1     
  Lines        9263     9324      +61     
  Branches     1695     1706      +11     
==========================================
+ Hits         8975     9036      +61     
  Misses        172      172              
  Partials      116      116              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@larsga larsga force-pushed the issue-3290 branch 2 times, most recently from fb8c123 to 7e394e4 Compare May 27, 2025 14:10
@stefan6419846
Copy link
Collaborator

Thanks for the PR.

Does this look OK?

At first sight it looks okay, but unless there are specific aspects to talk about, I tend to prefer a proper review once the automated checks were successful.

It seems like CI is still failing due to coverage, typing and code style issues - this is something to consider in a second step. Nevertheless, regarding the code style, I would prefer to move the new classes into a submodule of pypdf.generic to not bloat the pypdf._writer module where possible.

I'm wondering if some users may already have written their own link patching code -- will this code break theirs? If so, should we make it possible to turn this behaviour off somehow?

In theory, each release could break the code of some user when doing special stuff. As this fixes shortcomings of the current implementation without removing anything, I do not see the need to introduce a deprecation period for this at the moment.

At the moment I'm resolving everything by searching lists to find corresponding indirect references. It would be much faster with a hash, but I haven't been able to make that work. Thoughts?

What exactly have you tried and what has been the result?

@larsga
Copy link
Contributor Author

larsga commented May 27, 2025

I tend to prefer a proper review once the automated checks were successful.

Yeah, sorry. I had to cook dinner, and now I have a meeting. I didn't intend to leave the build broken like this. The trouble is I can't get the right version of ruff on my laptop right now, so all the checks end up being done in CI, which is slow. Anyway, I will sort all this out.

I would prefer to move the new classes into a submodule of pypdf.generic to not bloat the pypdf._writer module where possible.

Will do.

I do not see the need to introduce a deprecation period for this at the moment.

Ack! 👍

What exactly have you tried and what has been the result?

Mainly that I couldn't get the reference lookups to work, but I take this to mean that they should work. I'll work on this a bit more.

@larsga larsga force-pushed the issue-3290 branch 15 times, most recently from 7274240 to 09ea9b0 Compare May 28, 2025 13:05
@larsga
Copy link
Contributor Author

larsga commented May 28, 2025

Now it should finally be ready for review.

Note that I changed the type of the Destination.page property. As far as I can tell it's been wrong all the time. I certainly don't get an int back when I reference it. I get an IndirectObject, and that's also what the docs say should be passed in as the value. I was forced to do this to get type checking to accept my code -- let me know if you want this change separated out.

Once this PR is merged I'll look at handling links in merged-in pages, but because of upcoming holiday that will take a while.

@stefan6419846
Copy link
Collaborator

Thanks. I will try to have a look at this as soon as possible - this might take some time as well.

Regarding the broken type hints, there is a corresponding issue as well: #3233.

Copy link
Collaborator

@stefan6419846 stefan6419846 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I just had a first look at the changes and added some small comments. As general notes:

  • Using abbreviations in the names and docstrings should be avoided. It is completely fine to use "reference" instead of "ref" for example to improve clarity and avoid having to deprecate stuff later on.
  • In the type hints, please use type1, type2 instead of type1,type2. I have marked some cases, but not all.
  • Instead of nesting functions and bloating the already large modules, consider moving the corresponding functionality to the new submodule.

@larsga larsga force-pushed the issue-3290 branch 5 times, most recently from cd41ccd to c4486dd Compare June 18, 2025 09:09
@stefan6419846
Copy link
Collaborator

Thanks for your patience - larger PRs do not play nicely when there is lots of other stuff to do from my side.

I did some small corrections myself and added a comment. Currently, there is a merge conflict as well. It this is resolved, I will have a final look at it and try getting it merged.

@larsga
Copy link
Contributor Author

larsga commented Jul 21, 2025

Your review came in just as I left for holiday, so this took a while.

I managed to resolve the conflicts, so I think this is ready now.

Copy link
Collaborator

@stefan6419846 stefan6419846 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries for the delay - larger PRs might take some more time from my side as well.

I just did some basic formatting changes and it looks ready to be merged for me now.

@stefan6419846 stefan6419846 merged commit c17f03a into py-pdf:main Jul 22, 2025
16 checks passed
@larsga
Copy link
Contributor Author

larsga commented Jul 22, 2025

Fantastic to have this merged! Although this PR does not actually solve our problem. :) The next step is to also support link rewriting when pages have been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants